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The Use of Person-Fit Statistics to Analyze Placement Tests 

Abstract 

Person fit is a statistical index that can be used as a direct measure to assess test 
accuracy by analyzing the response pattern of examinees and identifying those who 
misfit the testing model. This misfitting is a source of inaccuracy in estimating an 
individual’s ability and it decreases the expected criterion-related validity of the test 
being used. In placement tests, where individuals are usually classified based on their 
estimated ability levels, misfiting results in misclassification of individuals which 
negatively affects the student as well as the academic organization. The study 
applied person-fit statistics as estimated by the standardized log-likelihood (I 2 ) index 
to analyze two placement tests used by freshmen college students. Results supported 
the use of person-fit statistics to effectively analyze placement tests. The distribution 
of person-fit statistics on each test was very close to the standardized normal 
distribution, and both tests accurately assessed the intended student’s ability. Results 
also showed that ability level and person-fit statistics are not related. As for the 
aberrant responses resulting from misfitting persons, two unusual common responses 
have been observed: missing most of the items at the end of the test and missing 
more easy items than expected at the beginning of the test. 
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The Use of Person-Fit Statistics to Analyze Placement Tests 

Introduction 

Test scores are commonly used to assess an individual’s ability. Whether the goal of the 
test is certification, college admission, detection of specific behavior, or personal selection, a 
decision about an individual’s ability is usually made based on his or her test score. However, 
test scores might be an inaccurate measure of a person’s ability; two individuals who score the 
same number correct on a test may have different aptitudes and abilities. The same score can be 
obtained by many probabilistic patterns (Hamisch & Linn, 1981). For example, if an individual 
answered the easiest 50% items of a test and a second individual answered the most difficult 
50% items, the two would get the same ability estimation (total score). Both the individual and 
the organization or the academic institution is negatively affected by the inaccurate estimation of 
ability levels. An overestimate of an individual’s ability could decrease the organizational 
productivity while an underestimate could exclude qualified individuals from possible 
opportunities; both could decrease the criterion-related validity of the test being used (Schmitt, 
Chan, Sacco, McFarland, & Jennings, 1999). One source of inaccuracy in estimating an 
individual’s ability occurs when he/she misfits the test model (e.g., Rasch). 

When item response theory (IRT) is used as a framework for testing, it is assumed that 
the test fits an IRT model. However, even when the test as a whole fits the testing model, some 
examinees do not. That is, taking into consideration their ability level, some examinees may 
give unusual or aberrant responses. One advantage of using IRT to represent test results is to be 
able to detect and identify misfitting persons. 




4 



Person Fit and Placement Tests 4 



Statistically, a person misfits the model if there is a difference between his/her expected 
and observed response pattern. The expected response pattern for each person is determined 
based on both the ability level (0) on the assessed trait and the assumed IRT model. For example, 
a low ability examinee who answered difficult items correctly would misfit most of the test 
models. 

Person fit is a statistical index that can be used as a direct measure of assessment 
accuracy by analyzing response patterns and identifying persons with aberrant or unusual 
responses. It is useful to adequately understand and interpret the test results using test scores. Not 
only ability but also several psychological, cognitive, and personal factors affect an individual’s 
responses to the test items (Hambleton, Swaminathan & Rogers, 1991). The effect of these 
factors may result in aberrant responses that are different than what is expected. In this context, 
person-fit statistics can be “a first step to trace persons whose answering behavior or part of it is 
the result of characteristics other than the latent ability that the test intends to measure” (Meijer 
& Sijtsma, 1994, p.7). 

In general, aberrant or unusual responses are many; however, only the common ones have 
been identified or described in the literature. Meijer (1996) described some of them assuming a 
multiple-choice test with binary scoring. For example, sleeping behavior is observed when an 
examinee has trouble beginning the test (slow-to-warm-up); guessing behavior occurs when a low- 
ability person answers difficult items correctly by guessing blindly; cheating behavior can be the 
case when a low-ability student gets the most difficult items correct by copying the answers from a 
more competent neighbor; plodding behavior results from working slowly and not moving to the 
next item; and alignment error happens, for example, when a high-ability student skips an item in the 
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test but forgets to skip it in the answer sheet. Other aberrant responses might be due to fatigue, 
unfamiliarity with the topic or the test format (Swearingen, 1 998), and scoring errors (Hulin, 
Drasgow, & Parsons, 1983). 

If the test fits the model then it accurately represents the relationship between the ability level 
and the responses of the examinee. In other words, the test scores are accurate measures of an 
examinee’s ability on the construct being assessed. Conversely, if the person misfits the model, then 
the accuracy of estimating his/her ability is negatively affected. Consequently, the decision made 
based on this assessment is inaccurate and this, as mentioned earlier, negatively affects the individual 
as well as the organization. 

Purpose 

Placement tests are commonly used in classifying persons in different levels or categories 
based on their ability level on some construct(s). Taking into consideration the fact that 
placement tests are usually followed by a decision regarding individuals, and that an increasing 
number of persons are taking placement tests for different purposes, an effective and accurate 
procedure to evaluate the accuracy of assessment and classification in placement tests is 
necessary. Person fit statistics can be used to analyze response patterns and to identify persons 
with unusual responses. Hence, it can be used to assess accuracy of placement tests. 

The purpose of this study was to investigate the use of person fit in evaluating the 
accuracy of placement tests. More specifically, the analysis and evaluation of a placement test 
was accomplished through addressing the following four research questions, with respect to the 
placement tests used: 

1 . What is the distribution of person-fit statistics? 
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2. What is the percentage of misfitting persons to the assumed IRT-testing model? 

3. Is there a relationship between person-fit statistics and ability level? 

4. What are the common aberrant responses? 

Research on Person Fit 

In the literature of person fit, few studies have been conducted in actual field 
applications. In this regard, Rudner (1995) stated that “although the need for person-fit statistics 
has been documented and uses for it have been suggested, for the most part, it has not yet been 
applied to many settings” (p.22). Meijer (1997) investigated the effect of model misfit on test 
validity in general and particularly on criterion-related validity. Meijer used simulated data that 
fit the three parameter logistic model (3PL) with four manipulated factors: number of items in 
the test, size of the correlation between the predictor test and the criterion test, proportion of 
nonfitting persons, and type of nonfitting. Results showed that the test validity decreased 
substantially when the type of misfit was severe or the correlation between the predictor and the 
criterion was high. A similar result was obtained when at least 1 5% of persons in the test misfit 
the model. 

Yoes & Ho (1991) investigated the degree of person misfit on a national standardized 
achievement test. Three person-fit indicators (the unweighted standardized mean square, the 
standardized weighted mean square (INFIX), and the standardized likelihood index (/z)) were 
used to identify the percentage of misfitting persons. The study used three subtests (science, 
reading comprehension, and spelling) of the Stanford Achievement Test. Results indicated that 
the percentage of students misfitting the model (Rasch) was small. Results also supported the use 
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of Iz and INFIX indices in the detecting of aberrant responses as compared with the unweighted 
standardized mean square index. 

Rudner (1995) used person-fit statistics in reporting and analyzing the results of the 
National Assessment of Educational Progress (NAEP). Using the IRT model-fit mean square 
statistics on data from the 1990 and 1992 NAEP assessment of mathematics, Rudner found that 
the data fit was good with very few abnormal responses (overall fit mean -.91, and standard 
deviation =.017). When blocking the analysis by states, fit was also good, and the same results 
were obtained for fit by race, community, and gender. In addition, no relationship was observed 
between person-ability and person-fit statistics (correlation ranges from -.20 to .17). 

Fit statistics was applied in the investigation of a few specific educational issues, for 
example, gender and race differences and the use of different instruction methods. Frary (1982) 
applied four different fit statistics indices on a large sample of eighth grade students and found 
that males and whites are more likely to show aberrant responses than females and blacks 
respectively. Tatsuoka & Tatsuoka (1982) used two different methods of instruction to teach 
addition in signed-number operation to two equal groups of students and then calculated person- 
fit statistics for each. Results of this study showed that the difference between the two groups 
was significant. In another application, fit statistic was successfully used to identify schools that 
have curricula not matching the test content (Hamisch & Linn, 1981). 

Method 

Data 

Person-fit analysis was applied on two placement tests used by freshmen college students 
in the United Arab Emirates University (UAEU). The tests are the English Placement Test (EPT) 
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and the Arabic Placement Test (APT). Both tests are required for all freshmen students and 
conducted yearly to measure and classify students for levels of English or Arabic study. Data 
from the 2001-2002 administration of the EPT were used. The sample consisted of 1558 students 
(1214 females and 344 males) who responded to 1 19 multiple-choice items. The test had an 
internal reliability of .92. The data used for the APT was from the 2000-2001 administration. The 
test consisted of 90 multiple-choice items with an internal reliability of .84. The sample size was 
1077 students (757 females and 320 males). Both tests were not speeded (more than 95% of 
students completed all items of each test). 

Person-Fit Index 

Person-fit indices are known by different names: “appropriateness measurement”, 
“response aberrancy”, “ scalability”, “individual consistency”, and “norm conformity ”. Some 
person-fit indices are person-group fit and others are person-model fit. Some indices are also 
based on the Classical Test Theory (CTT) while others are used under IRT models (see Meijer & 
Sijtsma, 2001 for extensive review of common person-fit indices). In the IRT framework, a 
person-fit index estimates the consistency of an individual’s responses given his/her ability level 
(0). Several person-fit indices have been derived and used. The standardized log-likelihood 
person-fit statistics (I 2 ) is one of the most accurate ones (Drasgow & Levine, 1986; Li & Olejnik, 
1997;Nering, 1997; Nering & Meijer, 1998; Reise & Due, 1991 ; Yoes & Ho, 1991). Iz is the 
standardized estimate of /o_ which is given in the following formula (Schmitt, Cortina, & 

Whiteny 1993): 

/fl = X {«- 'n Pi (^) + (1 - ) 'n[i - p, m } > 

7=1 
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where n is the number of items in the test, w, are the responses of the person to the /th item, and 

Pi( 6) is the probability of the response to item i given the person ability level 0. The Iz is given 

by standardizing In as follows: 

_ h-E{l,) 

" [VarQ,)^’ 

where E(ln) is the expected value of In and Far(/o)is the variance of In. The E(ln) is given in the 
following formula: 

E(ln) = f^{P,(0)\nP,(9) + O-P,m\nn-Pm}> 

/=1 

while the Far(/g) is given by the formula: 

Varil, P > ^ (^)]{ >11 ^ (^) ^ (^)] 

/=1 

According to Drasgow & Levine (1986), Iz has a distribution close to standardized 
normal distribution with a mean of 0 and a standard deviation of 1 at all 0 levels. Iz is not 
recommended with short tests of fewer than 20 items (Reise & Due, 1991) unless high 
discriminating items are used (Meijer & Sijtsma, 1994). Large /z values (greater than +2 or less 
than -2) are usually considered problematic responses (Ferrando & Lorenzo, 2000). 

Procedure 

A classical item analysis was conducted first on each test to determine the range of the 
discrimination index of items and the relative mean of items difficulty. This analysis is usually 
conducted to determine the appropriate IRT model to represent the data (Hambleton, et. al., 
1991). Data from each test were calibrated using BILOG (Mislevy & Bock, 1990) which uses the 
marginal maximum likelihood algorithm. The number and percentage of misfitting items to the 
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assumed model were used as another estimator of model-data fit. To clean up the data and to 
neutralize the effect of the item misfit on the analysis of person-fit, misfitting items were first 
identified and removed from the analysis. Items were then recalibrated and the item parameters 
(difficulty, discrimination, and guessing) and persons’ ability parameter (0) were estimated and 
used in the WPerfit Program (Ferrando & Lorenzo, 2000) to calculate the person-fit Iz value for 
each person. 

Results 

Unidimensionalitv 

This is the most important assumption underlying the use of IRT as a framework of 
testing. A test is said to be unidimensional if its items measure only one trait or ability. In 
practice, it is difficult to meet this assumption absolutely as many factors affect test performance. 
Factor analysis is commonly used to assess test unidimensionality. Reckase (1979) suggested 
that acceptable IRT parameter estimation could be obtained if the first factor accounts for at least 
20% of the variance. In this study the assumption was checked through the factor analysis of the 
inter-item tetrachoiric correlations calculated by the program TESTFACT. It was found that the 
first factor for the EPT explained 23.5% of the variance and the second factor explained only 
6.2%. For the APT, the first factor explained 28% while the second factor explained only 5.5% 
of the variance. Based on these results, it was reasonable to conclude that the unidimensionality 
assumption for the IRT models held for the two data sets. 

Model-data fit 

Results of the classical item analysis for the two tests are shown in Table 1 . The range of 
the item discrimination values for EPT and APT were .45 and .27 respectively. This large 
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variation in the item discrimination values indicated that the IPL model, which assumes equal 
discrimination values, was inappropriate to represent any of these data sets because it assumes 
equal discrimination values. In addition, as can be seen in Table 1 , the averages of item 
difficulties for both tests were relatively low (.38 for EPT and .54 for APT). This means that the 
items were not relatively easy. If items are not easy, then guessing is likely to be assumed when 
answering questions especially multiple-choice ones (Hambleton, et. al., 1991). Since the 3PL 
considers both guessing and variation in discrimination values, it was more appropriate to 
represent these data sets. 



Insert Table 1 about here 



The last three columns of Table 1 show numbers and percentages of the misfitting items 
for each IRT model. These values which can be used as indicators of model-data fit. As was 
expected from the classical item analysis mentioned already, the 3PL was better than the other 
two models in representing the data sets. For example, using IPL , 85 and 46 items misfit the 
model in EPT and APT respectively. The numbers of misfitting items were substantially 
decreased with the 3PL (23 and 1 2 items for the EAT and APT respectively). This result 
supported the use of the 3PL to represent the data of both tests. 

Based on both the classical item analysis and the analysis of model-data fit, the 3PL was 
found to better fit the two data sets. To neutralize the effect of misfitting items on the analysis of 
person-fit statistics, all misfitting items in each test were deleted from the final calibration. This 
resulted in deleting 23 items from the EPT and 12 items from the APT. 
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In terms of the first research question, results showed that the distribution of person-fit 
statistics followed the familiar bell curve quite well with an expected mean of 0 and standard 
deviation of 1 (Table 2). The mean of person-fit statistics for the EPT was . 1 7 with a standard 
deviation of .99, while the mean of that of the APT was very close to 0 (mean = .02) with a 
standard deviation of .92. The values of the Skewness and Kurtosis Tests were not significant (p 
< .05). 



Insert Table 2 about here 



The second research question addressed the percentage of the misfitting persons. The 
number and the percentage of missfitting persons on each test were very small. Only 75 (4.8%) 
and 33 (3.1%) persons misfit the test model in EPT and APT respectively. This result was less 
than what one might expect by chance only. The small number of misfitting persons indicated 
that the two placement tests had accurately assessed students’ ability. If students were going to 
be classified based on these results, then the overwhelming majority of them would be accurately 
classified. 

The third research question addressed the relationship between ability (0) and person-fit 
statistics. Because some of person-fit values were positive and some others were negative, 
person-fit absolute values were used in calculating the correlations. Results showed that person- 
fit statistics and ability were not correlated (correlation values were .002 and .13 for the EPT and 
APT respectively). In addition, the relationship between person-fit statistics and student college 
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GPA was checked and no correlation was found between the two variables (the correlations were 
-.01 and .02 forEPT and APT respectively). 

The fourth research question addressed the identification of the common aberrant responses 
that appeared in each test. To save space, only the aberrant responses on the EPT were analyzed. 
Generally, giving an explanation for every specific unusual response is not a straightforward task. 
This difficulty is due to the fact that some different unusual behaviors can produce similar aberrant 
responses. For example, for a given test it may be difficult to decide whether some particular 
aberrant response is due to cheating or guessing behavior. Second, although the unusual behavior is 
likely to produce an unusual response, this is not always the case; for example, a person who guessed 
on a test and got some items correct but some others wrong might not misfit any IRT model (Meijer, 
& Sijtsma, 2001). 

The responses of all misfitting persons (75 in the EPT) were analyzed first using the original 
order of the items as introduced in the test. Items were the rank-ordered based on their difficulty 
level, and the responses of the misfitting persons were analyzed again. In addition, all the aberrant 
responses were checked using the graphical plots offered by the WPerfit Program. 

Two common unusual responses have been observed in the EPT. The first was missing 
most of the items as the end of the test (the last 30% items), more than what was expected based 
on the ability level. This unusual response was demonstrated by 19 students. The possible 
explanation to this kind of misfitting was fatigue. The test was long (119 items), which probably 
negatively affected students’ motivation and/or concentration toward the end of the test. In 
addition, UAE students are not experienced enough with multiple-choice tests and their high 
school tests are usually much shorter. The second was missing more easy items than expected at 
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the beginning of the test, but then answering correctly many more items than expected. Fifteen 
students demonstrated responses similar to this one. The possible explanation to this unusual 
response was either cheating or guessing. Given that these tests are administrated under strict 
conditions with a maximum penalty for cheating, guessing was very likely to be the cause of this 
response. 

Discussion 

This study aims at applying the theoretically well-known person-fit statistics in analyzing 
placement tests. Person-fit statistics has been used as an accurate measure to identify aberrant 
responses that could be used as a measure of assessment accuracy. Placement tests are 
commonly used at different school levels and for different purposes. At the college level, 
placement tests are used to classify admitted students into studying levels. Misclassification of 
students based on the results of placement tests negatively affects both students and colleges. 
Taking into consideration the wide use of placement tests and the large number of examinees 
who take such tests, the accuracy of the assessment of these tests becomes a significant issue. 

In both placements tests used in this study, the distribution of person-fit statistics was 
very close to the standardized normal distribution. This result supports the use of person-fit 
statistics to effectively analyze placement tests. This analysis includes evaluating the assessment 
accuracy of a test and identifying the aberrant responses. Evaluating the assessment accuracy can 
result in improving the test itself (e.g., using different formatting, changing the test length, or 
using another assessment tool). For example, if it is observed that blind guessing mainly causes 
person misfit, then using item formatting that reduces guessing (e.g., essay questions) can be a 
good alternative. Another example is shortening the test if it is found that fatigue causes the 
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misfit. As for aberrant responses, the identification of the misfitting persons, their number, and 
percentage allows the organization to estimate the “size” and the effect of inaccurate estimation. 

With regard to the relationship between ability and person-fit statistics, results show that 
there is no relationship between the two variables. This result is consistent with the result of 
Rudner (1995). The absence of a relationship between ability and person-fit indicates that there 
are psychological and personal factors other than ability responsible for misfitting. For example, 
level of motivation, level of test anxiety, tendency to guess (Hambleton, et. al., 1991), tendency 
to cheat, attitude toward the topic, level of attentiveness, and sincerity (Green, 1996) affect 
examinees’ responses. This supports the use of person-fit statistics to identify examinees affected 
by such factors as it cannot be achieved using test scores. However, how such psychological and 
personal factors affect person-fit statistics has not been investigated yet. Further research needs 
to be directed toward studying the effect of these factors on person-fit statistics. 

Identifying the number or the percentage of misfitting persons is essential in the 
evaluation of a test and its accuracy. However, a comprehensive understanding and evaluation of 
the test may not be accomplished without identifying the common aberrant responses 
demonstrated by misfitting persons and the best interpretation of each. 
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Table 1 



Classical Item Analysis of EPT and APT and Misfitting Items 



Test 


No. of 


Mean 


Range 


Mean 


Number and percentage 




items 


of item 


of item 


of items 


of misfitting items 






discrimination 


discrimination 


difficulty 


















IPL 


2PL 


3PL 


EPT 


119 


.39 


.45 


.38 


85 


53 


23 












(71%) 


(45%) 


(19%) 


APT 


90 


.26 


.27 


.54 


46 


22 


12 












(51%) 


(24%) 


(13%) 



Note . EPT: English Placement Test, APT: Arabic Placement Test 
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Table 2 

Distribution of Person-fit Statistics 



Test 


Final 

no. of items 


Mean 


SD 


KS 


Skewness 


Kurtosis 


EPT 


96 


.17 


.99 


.03 


-.30 


.08 


APT 


78 


.02 


.92 


.02 


-.16 


.09 



Note . EPT English Placemat Test, APT: Arabic Placement Test, 
KS : Kolmogorove-Smirnov Test of Normality 
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